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Preface 



The proceedings of ECML /PKDD 2004 are published in two separate, albeit in- 
tertwined, volumes: the Proceedings of the 15th European Conference on Machi- 
ne Learning (LNAI 3201) and the Proceedings of the 8th European Conferences 
on Principles and Practice of Knowledge Discovery in Databases (LNAI 3202). 
The two conferences were co-located in Pisa, Tuscany, Italy during September 
20-24, 2004. 

It was the fourth time in a row that ECML and PKDD were co-located. Af- 
ter the successful co-locations in Freiburg (2001), Helsinki (2002), and Cavtat- 
Dubrovnik (2003), it became clear that researchers strongly supported the orga- 
nization of a major scientific event about machine learning and data mining in 
Europe. 

We are happy to provide some statistics about the conferences. 581 different 
papers were submitted to ECML/PKDD (about a 75% increase over 2003); 280 
were submitted to ECML 2004 only, 194 were submitted to PKDD 2004 only, and 
107 were submitted to both. Around half of the authors for submitted papers are 
from outside Europe, which is a clear indicator of the increasing attractiveness 
of ECML/PKDD. 

The Program Committee members were deeply involved in what turned out 
to be a highly competitive selection process. We assigned each paper to 3 re- 
viewers, deciding on the appropriate PC for papers submitted to both ECML 
and PKDD. As a result, ECML PC members reviewed 312 papers and PKDD 
PC members reviewed 269 papers. We accepted for publication regular papers 
(45 for ECML 2004 and 39 for PKDD 2004) and short papers that were asso- 
ciated with poster presentations (6 for ECML 2004 and 9 for PKDD 2004). The 
global acceptance rate was 14.5% for regular papers (17% if we include the short 
papers). 

The scientific program of ECML/PKDD 2004 also included 5 invited talks, a 
wide workshop and tutorial program (10 workshops plus a Discovery Challenge 
workshop, and seven tutorials) and a demo session. 

We wish to express our gratitude to: 

— the authors of the submitted papers; 

— the program committee members and the additional referees for their excep- 
tional contribution to a tough but crucial selection process; 

— the invited speakers: Dimitris Achlioptas (Microsoft Research, Redmond), 
Rakesh Agrawal (IBM Almaden Research Center) , Soumen Chakrabarti (In- 
dian Institute of Technology, Bombay), Pedro Domingos (University of Wa- 
shington, Seattle), and David J. Hand (Imperial College, London); 

— the workshop chairs Donato Malerba and Mohammed J. Zaki; 

— the tutorial chairs Katharina Morik and Franco Turini; 

— the discovery challenge chairs Petr Berka and Bruno Cremilleux; 
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— the publicity chair Salvatore Ruggieri; 

— the demonstration chairs Rosa Meo, Elena Baralis, and Codrina Lauth; 

— the members of the ECML/PKDD Steering Committee Peter Flach, Luc De 
Raedt, Arno Siebes, Nada Lavrac, Dragan Gamberger, Ljupco Todorovski, 
Hendrik Blockeel, Tapio Elomaa, Heikki Mannila, and Hannu T.T. Toivonen; 

— the members of the Award Committee, Michael May and Foster Provost; 

— the workshops organizers and the tutorialists; 

— the extremely efficient Organization Committee members, Maurizio Atzori, 
Miriam Baglioni, Sergio Barsocchi, Jeremy Besson, Francesco Bonchi, Stefa- 
no Ferilli, Tiziana Mazzone, Mirco Nanni, Ruggero Pensa, Simone Puntoni, 
Chiara Renso, Salvatore Rinzivillo, as well as all the other members of the 
KDD Lab in Pisa, Laura Balbarini and Cristina Rosamilia of L&B Studio, 
Elena Perini and Elena Tonsini of the University of Pisa; 

— the great Web masters Mirco Nanni, Chiara Renso and Salvatore Rinzivillo; 

— the directors of the two research institutions in Pisa that jointly made this 
event possible, Piero Maestrini (ISTI-CNR) and Ugo Montanari (Diparti- 
mento di Tnformatica); 

— the administration staff of the two research institutions in Pisa, in particu- 
lar Massimiliano Farnesi (ISTI-CNR), Paola Fabiani and Letizia Petrellese 
(Dipartimento di Informatica) ; 

— Richard van de Stadt (www.borbala.com) for his efficient support to the 
management of the whole submission and evaluation process by means of 
the CyberChairPRO software; 

— Alfred Hofmann of Springer for co-operation in publishing the proceedings. 

We gratefully acknowledge the financial support of KDNet, the Pascal Net- 
work, Kluwer and the Machine Learning journal, Springer, the Province of Luc- 
ca, the Province of Pisa, the Municipality of Pisa, Microsoft Research, COOP, 
Exeura, Intel, Talent, INSA-Lyon, ISTI-CNR Pisa, the University of Pisa, the 
University of Bari, and the patronage of Regione Toscana. 

There is no doubt that the impressive scientific activities in machine lear- 
ning and data mining world-wide were well demonstrated in Pisa. We had an 
exciting week in Tuscany, enhancing further co-operations between the many re- 
searchers who are pushing knowledge discovery into becoming a mature scientific 
discipline. 



July 2004 Jean-Frangois Boulicaut, 

Floriana Esposito, 
Fosca Giannotti, 
and Dino Pedreschi 
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Abstract. We show how carefully crafted random matrices can achieve 
distance-preserving dimensionality reduction, accelerate spectral compu- 
tations, and reduce the sample complexity of certain kernel methods. 



1 Introduction 

Given a collection of n data points (vectors) in high-dimensional Euclidean space 
it is natural to ask whether they can be projected into a lower dimensional 
Euclidean space without suffering great distortion. Two particularly interesting 
classes of projections are: i) projections that tend to preserve the interpoint 
distances, and ii) projections that maximize the average projected vector length. 

In the last few years, distance-preserving projections have had great impact in 
theoretical computer science where they have been useful in a variety of algorith- 
mic settings, such as approximate nearest neighbor search, clustering, learning 
mixtures of distributions, and computing statistics of streamed data. 

The general idea is that by providing a low dimensional representation of the 
data, distance-preserving embeddings dramatically speed up algorithms whose 
run-time depends exponentially in the dimension of the working space. At the 
same time, the provided guarantee regarding pairwise distances often allows one 
to show that the solution found by working in the low dimensional space is a 
good approximation to the solution in the original space. 

Perhaps the most commonly used projections aim at maximizing the average 
projected vector length, thus retaining most of the variance in the data. This 
involves representing the data as a matrix A, diagonalizing A = U DV, and pro- 
jecting A onto subspaces spanned by the vectors in U or V corresponding to the 
largest entries in D. Variants of this idea are known as Karhunen-Loeve trans- 
form, Principal Component Analysis, Singular Value Decomposition and others. 

In this paper we examine different applications of random matrices to both 
kinds of projections, all stemming from variations of the following basic fact: if 
R is an n x n random matrix whose entries are i.i.d. Normal random variables, 
N(0, 1), then the matrix -)= R is very close to being orthonormal. 



2 Euclidean Distance Preservation 

A classic result of Johnson and Lindenstrauss [7] asserts that any set of n points 
in can be embedded into R fc , with k = 0(log n), so that all pairwise distances 
are maintained within an arbitrarily small factor. More precisely, 
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Lemma 1 ([7]). Given 0 < e < 1 and an integer n, let k be a positive integer 
such that k > ko = (12/e 2 ) log n. For every set P of n points in there exists 
/ : — > R fc such that for all u,v € P 

( l - e ) ll ^-^ M 2 < ||/(^) -/(^)|| 2 < (l + e )||^^ ^|| 2 . 

Perhaps, a naive attempt to construct an embedding as above would be to 
pick a random set of k coordinates from the original space. Unfortunately, two 
points can be very far apart while differing only along one original dimension, 
dooming this approach. On the other hand, if (somehow) for all pairs of points, 
all coordinates contributed “roughly equally” to their distance, such a sampling 
scheme would be very natural. This consideration motives the following idea: first 
apply a random rotation to the n points, and then pick the first k coordinates as 
the new coordinates. The random rotation can be viewed as a form of insurance 
against axis alignment, analogous to applying a random permutation before 
running Quicksort. 

Of course, applying a random rotation and then taking the first k coordinates 
is equivalent to projecting the n points on a uniformly random fc-dimensional 
subspace. Indeed, this is exactly how the original proof of Lemma 1 by Johnson 
and Lindenstrauss proceeds: to implement the embedding, multiply the n x d 
data matrix A with a random d x k orthonormal matrix. Dasgupta and Gupta [5] 
and, independently, Indyk and Motwani [6] more recently gave a simpler proof of 
Lemma 1 by taking the following more relaxed approach towards orthonormality. 

The key idea is to consider what happens if we multiply A with a random dxk 
matrix R whose entries are independent Normal random variables with mean 0 
and variance 1, i.e., IV(0, 1). It turns out that while we do not explicitly enforce 
either orthogonality or normality in R , its columns will come very close to having 
both of these properties. This is because, as d increases: (i) the length of each 
column- vector concentrates around its expectation as the sum of d independent 
random variables; (ii) by the spherical symmetry of the Gaussian distribution, 
each column- vector points in a uniformly random direction in R d , making the 
k < d independent column-vectors nearly orthogonal with high probability. 

More generally, let I? be a random matrix whose entries are independent 
random variables with E(r'y) = 0 and Var (r^) = 1. If / : R d — > R fe is given by 

/0) = -j= xR , 

it is easy to check that for any vector x £ we have E(||/(x)||) = ||a;||. 
Effectively, the squared inner product of x with each column of R acts as an 
independent estimate of ||a;|| 2 , making ||/(a;)|| 2 the consensus estimate (sum) of 
the k estimators. Seen from this angle, requiring the k vectors to be orthonormal 
simply maximizes the mutual information of the k estimators. For good dimen- 
sionality reduction, we also need to minimize the variance of the estimators. 

In [1], it was shown that taking = ±1 with equal probability, in fact, 
slightly reduces the number of required dimensions k (as the variance of each 
column-estimator is slightly smaller). At the same time, and more importantly, 
this choice of r,j makes / a lot easier to work with in practice. 
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3 Computing Low Rank Approximations 

Given n points in represented as an n x d matrix A , one of the most common 
tasks in data analysis is to find the “top k” singular vectors of A and then project 
A onto the subspace they span. Such low rank approximations are used widely 
in areas such as computer vision, information retrieval, and machine learning to 
extract correlations and remove noise from matrix-structured data. 

Recall that the top singular vector of a matrix A is the maximizer of 1 1 Aar 1 1 2 
over all unit vectors x. This maximum is known as the L 2 norm of A and the 
maximizer captures the dominant linear trend in A. Remarkably, this maximizer 
can be discovered by starting with a random unit vector x £ and repeating the 
following “voting process” until it reaches a fixpoint, i.e. , until x stops rotating: 

— Have each of the n rows in A vote on candidate x, i.e., compute y = Ax £ R". 

— Compose a new candidate by combining the rows of A, weighing each row 

A T y 

by its enthusiasm for x, i.e., update x <— ' £ R d . 

\\ A V II 

The above idea extends to k > 1. To find the fc-dimensional invariant sub- 
space of A, one starts with a random subspace, i.e., a random d x k orthonormal 
matrix, and repeatedly multiplies by A T A (orthonormalizing after each mul- 
tiplication). Computing the singular row-vectors of A, i.e., the eigenvectors of 
B = A T A, is often referred to as Principal Component Analysis (PCA). The fol- 
lowing process achieves the exact same goal, by extracting the dominant trends in 
A sequentially, in order of strength: let Ao be the all zeros matrix; for i — 1, . . . , k: 

— Find the top singular vector, Xi, of A — Aj_i, via the voting process above. 

— Let A,j = Aj_i + Axixf , i.e., A is the optimal rank i approximation to A. 



To get an idea of how low rank approximations can remove noise, let G be 
annxd random matrix whose entries are i.i.d. iV(0,(7 2 ) random variables. We 
saw earlier that each column of G points in an independent, uniformly random 
direction in M™. As a result, when n is large, with high probability the d < n 
columns of G are nearly orthogonal and there is no low-dimensional subspace 
that simultaneously accommodates many of them. This means that when we 
compute a low rank approximation of A + G, as long as a is “not too large” (in 
a sense we will make precise) , the columns of G will exert little influence as they 
do not strongly favor any particular low-dimensional subspace. Assuming that 
A contains strong linear trends, it is its columns that will command and receive 
accommodation . 

To make this intuition more precise, we first state a general bound on the 
impact that a matrix N can have on the optimal rank k approximation of a 
matrix A, denoted by Ah, as a function of ||iVfc||. Recall that ||A||f = \ Yhi j A ij ■ 
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Lemma 2. For any matrices A and A, if A = A + A then 
II ^4 — Af~ U 2 < ||A — 1 1 2 + 2 1 1 iVfc 1 1 2 and 

|| A — A k \\F < ||A — Ak\\F + 1 1 Afc 1 1 F + 2-\/|| Afc||.p||Afc||F ■ 

Notice that all error terms above scale with ||Afc||. As a result, whenever A is 
poorly approximated in k dimensions, i.e., ||Afc|| is small, the error caused by 
adding A” to a matrix A is also small. 

Let us consider the norms of our Gaussian perturbation matrix. 

Fact 1 Let G be a random n x d matrix, where d < n, whose entries are i.i.d. 
random variables A( 0, a 2 ). For any e > 0, with probability 1 — l/poly(n, e), 

||G|| 2 = ||G fc || 2 < (2 + e) er\/n and ||G fe ||F < (2 + e) aVkn . 

Remarkably, the upper bound above for ||G|| 2 is within a factor of 2 of the lower 
bound Osjn on the L 2 norm of any n x d matrix with mean squared entry a 2 . 
In other words, a random Gaussian matrix is nearly as unstructured as possible, 
resembling white noise in the flatness of its spectrum. On the other hand, ||A|| 2 
can be as large as a\fdn for an n x d matrix A with mean squared entry a 2 . 

This capacity of spectral techniques to remove Gaussian noise is by now very 
well-understood. We will see that the above geometric explanation of this fact 
can actually accommodate much more general noise models, e.g. Ay that are 
not identically distributed and, in fact, whose distribution depends on Ay. In 
the next section, this generality will enable the notion of “computation-friendly 
noise”, i.e., noise that enhances (rather than hinders) spectral computations. 

Fact 1 also suggests a criterion for choosing a good value of k when seeking 
low rank approximations of a n x d data matrix A: 

II A — Afc || 2 ~ a^fn, where a 2 is the mean squared entry in A — Ak- 
in words: we should stop when, after projecting onto the top k singular vectors, 
we are left with a matrix, A — Afc , whose strongest linear trend is comparable to 
that of a random matrix of similar scale. 

3.1 Co-opting the Noise Process 

Computing optimal low rank approximations of large matrices often runs against 
practical computational limits since the algorithms for this task generally require 
superlinear time and a large working set. On the other hand, in many applica- 
tions it is perfectly acceptable just to find a rank k matrix C satisfying 

II A — C|| < || A — Afc || + 5 , 

where Afc is the optimal rank k approximation of the input matrix A, and S 
captures an appropriate notion of “error tolerance” for the domain at hand. 
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In [2] , it was shown that with the aid of randomization one can exploit such an 
“error allotment” to aid spectral computations. The main idea is as follows. 

Imagine, first, that we squander the error allotment by obliviously adding to 
A a Gaussian matrix G, as in the previous section. While this is not likely to 
yield a computationally advantageous matrix, we saw that at least it is rather 
harmless. The first step in using noise to aid computation is realizing that G is 
innocuous due precisely to the following three properties of its entries: 

independence, zero mean, small variance. 

The fact that the Gij are Gaussian is not essential: a fundamental result of 
Fliredi and Komlos [4] shows that Fact 1 generalizes to random matrices where 
the entries can have different, in fact arbitrary, distributions as long as all N,j :j 
are zero-mean, independent, and their variance is bounded by a 2 . 

To exploit this fact for computational gain, given a matrix A, we will create 
a distribution of noise matrices N that depends on A, yet is such that the ran- 
dom variables Nij still enjoy independence, zero mean, and small variance. In 
particular, we will be able to choose N so that A = A + N has computationally 
useful properties, such as sparsity, yet N is sufficiently random for || W, || to be 
small with high probability. 

Example: Set N t j = AA z j with equal probability, independently for all i,j. 

In this example, the random variables N i: j are independent, E[Al,:j] = 0 for 
all i, j, and the standard deviation of equals Aij. On the other hand, the 
matrix A = A + N will have about half as many non-zero entries as A, i.e., it 
will be about twice as sparse. Therefore, while ||A ||2 can be proportional to Vdn, 
the error term || AT|| 2 , i.e., the price for the sparsification, is only proportional to 
Vn. 

The rather innocent example above can be greatly generalized. To simplify 
exposition, in the following, we assume that A^ £ [ — 1 , +1] . 

— Quantization: For all i,j , independently, set A. ij to +1 with probability 
(1 + A^)/ 2, and to —1 with probability (1 — Aij)/ 2. Clearly, for all i,j, we 
have E[iVjj] = E [A^ — A^] = 0, while Vai(Nij) < N/j < 4. 

— Uniform sampling: For any desired fraction p £ (0,1], set Ajj = Aij/p 
with probability p , and 0 otherwise. Now, Var (Nij) = Afj ( 1 — p)/p < 1/p, 
so that the error grows only as 1 /^/p as we retain a p-fraction of all entries. 

— Weighted sampling: For all i,j , independently, set A ^ = A^/pij with 
probability p l j , and 0 otherwise, where pij = pA 2 j ■ This way we retain even 
fewer small entries, while maintaining Var (N^) = 1/p — Afj < 1/p. 

Reducing the number of non-zero entries and their representation length 
causes standard eigenvalue algorithms to work faster. Moreover, the reduced 
memory footprint of the matrix A enables the handling of larger data sets. At a 
high level, we perform data reduction by randomly perturbing each data vector 
so as to simplify its representation, i.e., sparsify and quantize. The point is that 
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the perturbation vectors we use, by virtue of their independence, do not fit in a 
small subspace, acting effectively as “white noise” that is largely filtered out. 

4 Kernel Principal Component Analysis 

Given a collection X of training data x \, . . . , x n £ K d , techniques such as linear 
SVMs and PCA extract features from X by computing linear functions of X. 
However, often the structure present in the training data is not a linear function 
of the data representation. Worse, many data sets do not readily support linear 
operations such as addition and scalar multiplication (text, for example). 

In a “kernel method” the idea is to map X into a space TL equipped with 
inner product. The dimension of TL can be very large, even infinite, and therefore 
it may not be practical (or possible) to work with the mapped data explicitly 
by applying <P : X — > TL . Nevertheless, in many interesting cases it is possible to 
efficiently evaluate the dot products (<P(xi),<P(xj)) via a positive definite kernel 
k for i.e., a function k so that k(xi,Xj) = ($(xi),<l>(xj)). Algorithms whose 
operations can be expressed in terms of inner products can thus operate on 'P(X) 
implicitly, given only the Gram matrix 

I<ij := k(xi,Xj) . 

Given n training data points, the Kernel PCA (KPCA) method [8] begins 
by forming the Gram matrix K above and computing the t largest eigenvalues, 
Ai, . . . , X(, and corresponding eigenvectors, ei, . . . , of K 1 for some appropriate 
choice of i < n. Then, given an input point x, the method computes the value of 
the l nonlinear feature extractors, corresponding to the inner product of the vec- 
tor k(x) = ( k(x , X \ ), k(x , X 2 ), . . . , k{x , x n )) with each of the eigenvectors. These 
feature-values can be used for clustering, classification etc. 

While Kernel PCA is very powerful the matrix K, in general, is dense mak- 
ing the input size scale as n , where n is the number of training points. As 
kernel functions become increasingly more sophisticated, e.g. invoking dynamic 
programming to evaluate the similarity k(xi,Xj) of two strings Xi,Xj , just the 
cost of 0(n 2 ) kernel evaluations to construct K rapidly becomes prohibitive. 

The uniform sparsification and quantization techniques of the previous sec- 
tion are ideally suited for speeding up KPCA. In particular, “sparsification” here 
means that we actually only construct a matrix K by computing k(xi,Xj) for a 
uniformly random subset of all input pairs Xi, Xj and filling in 0 for the remaining 
pairs. In [3], it was proven that as long as K has strong linear structure (which 
is what justifies KPCA in the first place), with high probability, the invariant 
subspaces of K will be very close to those of K. 

Also, akin to quantization, we can replace each exact evaluation of k(xi,Xj ) 
with a more easily computable unbiased estimate for it. In [3], it was shown 
that for kernels where: i) X C R d , and, ii) k(xi,Xj) depends only on ||xj — Xj || 
and/or Xi ■ Xj, one can use random projections, as described in Section 2 for 
this purpose. Note that this covers some of the most popular kernels, e.g., radial 
basis functions (RBF) and polynomial kernels. 
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5 Future Work 

Geometric and spectral properties of random matrices with zero-mean, indepen- 
dent entries are the key ingredients in all three examples we considered [1-3]. 
More general ensembles of random matrices hold great promise for algorithm 
design and call for a random matrix theory motivated from a computational per- 
spective. Two natural directions are the investigation of matrices with limited 
independence, and the development of concentration inequalities for non-linear 
functionals of random matrices. 

We saw that sampling and quantizing matrices can be viewed as injecting 
“noise” into them to endow useful properties such as sparsity and succinctness. 
The distinguishing feature of this viewpoint is that the effect of randomization is 
established without an explicit analysis of the interaction between randomness 
and computation. Instead, matrix norms act as an interface between the two 
domains: (i) matrix perturbation theory asserts that matrices of small spectral 
norm cannot have a large effect in eigencomputations, while (ii) random matrix 
theory asserts that matrices of zero-mean, independent random variables with 
small variance have small spectral norm. Is it possible to extend this style of 
analysis to other machine-learning settings, e.g. Support Vector Machines? 
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There is increasing need to build information systems that protect the privacy 
and ownership of data without impeding the flow of information. We will present 
some of our current work to demonstrate the technical feasibility of building such 
systems: 

Privacy-preserving data mining. The conventional wisdom held that data min- 
ing and privacy were adversaries, and the use of data mining must be restricted 
to protect privacy. Privacy-preserving data mining cleverly finesses this conflict 
by exploting the difference between the level where we care about privacy, i.e. , 
individual data, and the level where we run data mining algorithms, i.e., ag- 
gregated data. User data is randomized such that it is impossible to recover 
anything meaningful at the individual level, while still allowing the data mining 
algorithms to recover aggregate information, build mining models, and provide 
actionable insights. 

Hippocratic databases. Unlike the current systems, Hippocratic databases include 
responsibility for the privacy of data they manage as a founding tenet. Their 
core capabilities have been distilled from the principles behind current privacy 
legislations and guidelines. We identify the technical challenges and problems in 
designing Hippocratic databases, and also outline some solutions. 

Sovereign information sharing. Current information integration approaches are 
based on the assumption that the data in each database can be revealed com- 
pletely to the other databases. Trends such as end-to-end integration, outsourc- 
ing, and security are creating the need for integrating information across au- 
tonomous entities. In such cases, the enterprises do not wish to completely re- 
veal their data. In fact, they would like to reveal minimal information apart 
from the answer to the query. We have formalized the problem, identified key 
operations, and designed algorithms for these operations, thereby enabling a new 
class of applications, including information exchange between security agencies, 
intellectual property licensing, crime prevention, and medical research. 
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Abstract. The next wave in search technology will be driven by the 
identification, extraction, and exploitation of real-world entities repre- 
sented in unstructured textual sources. Search systems will either let 
users express information needs naturally and analyze them more intel- 
ligently, or allow simple enhancements that add more user control on 
the search process. The data model will exploit graph structure where 
available, but not impose structure by fiat. First generation Web search, 
which uses graph information at the macroscopic level of inter-page hy- 
perlinks, will be enhanced to use fine-grained graph models involving 
page regions, tables, sentences, phrases, and real-world-entities. New al- 
gorithms will combine probabilistic evidence from diverse features to 
produce responses that are not URLs or pages, but entities and their 
relationships, or explanations of how multiple entities are related. 



1 Toward More Expressive Search 

Search systems for unstructured textual data have improved enormously since 
the days of boolean queries over title and abstract catalogs in libraries. Web 
search engines index much of the full text from billions of Web pages and serve 
hundreds of millions of users per day. They use rich features extracted from the 
graph structure and markups in hypertext corpora. 

Despite these advances, even the most popular search engines make us feel 
that we are searching with mere strings: we do not find direct expression of 
the entities involved in our information need, leave alone relations that must 
hold between those entities in a proper response. In a plenary talk at the 2004 
World-wide Web Conference, Udi Manber commented: 

If music had been invented ten years ago along with the Web, we would 
all be playing one-string instruments (and not making great music). 

referring to the one-line text boxes in which users type in 1-2 keywords and 
expect perfect gratification with the responses. 

Apart from classical Information Retrieval (IR), several communities are 
coming together in the quest of expressive search, but they are coming from 
very different origins. 

Databases and XML: To be sure, the large gap between the user’s information 
need and the expressed query is well-known. The database community has been 
traditionally uncomfortable with the imprecise nature of queries inherent in IR. 
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The preference for precise semantics has persisted from SQL to XQuery (the 
query language proposed for XML data). The rigor, while useful for system- 
building, has little appeal for the end-user, who will not type SQL, leave alone 
XQuery. 

Two communities are situated somewhere between “uninterpreted” keyword 
search systems and the rigor of database query engines. Various sub-communities 
of natural language processing (NLP) researchers are concerned with NL inter- 
faces to query systems. The other community, which has broad overlaps with 
the NLP community, deals with information extraction (IE). 

NLP: Classical NLP is concerned with annotating grammatical natural lan- 
guage with parts of speech (POS), chunking phrases and clauses, disambiguating 
polysemous words, extracting a syntactic parse, resolving pronoun and other ref- 
erences, analyze roles (eating with a spoon vs. with a friend), prepare a complete 
computer-usable representation of the knowledge embedded in the original text, 
and perform automatic inference with this knowledge representation. Outside 
controlled domains, most of these, especially the latter ones, are very ambitious 
goals. Over the last decade, NLP research has gradually moved toward building 
robust tools for the simpler tasks [19]. 

IE: Relatively simple NLP tasks, such as POS tagging, named entity tagging, 
and word sense disambiguation (WSD) share many techniques from machine 
learning and data mining. Many such tasks model unstructured text as a se- 
quence of tokens generated from a finite state machine, and solve the reverse 
problem: given the output token sequence, estimate the state sequence. E.g., if 
we are interested in extracting dates from text, we can have a positive and a 
negative state, and identify the text spans generated by the positive state. IE is 
commonly set up as a supervised learning problem, which requires training text 
with labeled spans. 

Obviously, to improve the search experience, we need that 

— Users express their information need in some more detail, while minimizing 

additional cognitive burden 

— The system makes intelligent use of said detail, thus rewarding the burden 

the user agrees to undertake 

This new contract will work only if the combination of social engineering and 
technological advances work efficiently in concert. 

2 The New Contract: Query Syntax 

Suitable user interfaces, social engineering, and reward must urge the user to 
express their information need in some more detail. Relevance feedback, offering 
query refinements, and encouraging the user to drill down into response clusters 
are some ways in which systems collect additional information about the user’s 
information need. But there are many situations where direct input from the 
user can be useful. I will discuss two kinds of query augmentation. 

Fragments of types: If the token 2000 appears in grammatical text, cur- 
rent technology can usually disambiguate between the year and some other 
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number, say a money amount. There is no reason why search interfaces can- 
not accept a query with a type hint so as to avoid spurious matches. There is 
also no reason a user cannot look for persons related to SVMs using the query 
PersonType NEAR "SVM", where PersonType is the anticipated response type 
and SVM a word to match. To look for a book in SVMs published around year 
2000, one might type BookType (NEAR "SVM" year~2000). I believe that the 
person composing the query, being the stakeholder in response quality, can be 
encouraged to provide such elementary additional information, provided the re- 
ward is quickly tangible. Moreover, reasonably deep processing power can be 
spent on the query, and this may even be delegated to the client computer. 

Attributes, roles and relations: Beyond annotating query tokens with type 
information, the user may want to express that they are looking for “a business 
that repairs iMacs,” “the transfer bandwidth of USB2.0,” and “papers written 
in 1985 by C. Mohan.” It should be possible to express broad relations between 
entities in the query, possibly the placeholder entity that must be instantiated 
into the answer. The user may constrain the placeholder entity using attributes 
(e.g. MacOS-compliant software), roles and relations (e.g., a student advised by 
X). The challenge will be to support an ever- widening set of attribute types, 
roles and relations while ensuring ongoing isolation and compatibility between 
knowledge bases, features, and algorithms. 

Compared to query syntax and preprocessing, whose success depends largely 
on human factors, we have more to say about executing the internal form of the 
query on a preprocessed corpus. 

3 The New Contract: Corpus and Query Processing 

While modest changes may be possible in users’ query behavior, there is far too 
much inertia to expect content creators to actively assist mediation in the imme- 
diate future. Besides, questions preprocessing can be distributed economically, 
but corpus processing usually cannot. 

The situation calls for relatively light processing of the corpus, at least until 
query time. During large scale use, however, a sizable fraction of the corpus may 
undergo complex processing. It would be desirable but possibly challenging to 
cache the intermediate results in a way that can be reused efficiently. 

3.1 Supervised Entity Extraction 

Information extraction (IE), also called named entity tagging, annotates spans of 
unstructured text with markers for instances of specified types, such as people, 
organizations, places, dates, and quantities. 

A popular framework [11] models the text as a linear sequence of tokens 
being generated from a Markov state machine. A parametric model for state 
transition and symbol emission is learned from labeled training data. Then the 
model is evaluated on test data, and spans of tokens likely to be generated by 
desired states are picked off as extracted entities. 

Generative models such as hidden Markov models (HMMs) have been used 
for IE for a while [7]. If s is the (unknown) sequence of states and x the sequence 
of output features, HMMs seek to optimize the joint likelihood Pr(s,x). 
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In general, x is a sequence of feature vectors. Apart from the tokens them- 
selves, some derived features found beneficial in IE are of the form: Does the 
token 

— Contain a digit, or digits and commas? 

— Contain patterns like DD:DD or DDDD or DD’s where D is a digit? 

— Follow a preposition? 

— Look like a proper noun (as flagged by a part-of-speeclr tagger 1 )? 

— Start with an uppercase letter? 

— Start with an uppercase letter and continue with lowercase letters? 

— Look like an abbreviation (e.g., uppercase letters alternating with periods)? 

The large dimensionality of the feature vectors usually corners us into naive inde- 
pendence assumptions about Pr(s,x), and the large redundancy across features 
then lead to poor estimates of the joint distribution. 

Recent advances in modeling conditional distributions [18] directly optimize 
Pr(s|x), allowing the use of many redundant features without attempting to 
model the distribution over x itself. 



3.2 Linkage Analysis and Alias Resolution 

After the IE step, spans of characters and tokens are marked with type iden- 
tifiers. However, many string spans (called aliases ) may refer to a single en- 
tity (e.g., IBM, International Business Machines , Big Blue, the computer giant 
or www.ibm.com). The variations may be based on abbreviations, pronouns, 
anaphora, hyperlinks and other creative ways to create shared references to en- 
tities. Some of these aliases are syntactically similar to each other but others are 
not. 

In general, detecting aliases from unstructured text, also called coreferent 
resolution, in a complete and correct manner is considered “NLP complete,” 
i.e., requires deep language understanding and vast amounts of world knowl- 
edge. Alias resolution is an active and difficult area of NLP research. In the IE 
community, more tangible success has been achieved within the relatively limited 
scope of record linkage. 

In record linkage, the first IE step results in structured tables of entities, 
each having attributes and relations to other entities. E.g., we may apply IE 
techniques to bibliographies at the end of research papers to populate a table of 
papers, authors, conferences/journals, etc. Multiple rows in each table may refer 
to the same object. Similar problems may arise in Web search involving names 
of people, products, and organizations. 

The goal of record linkage is to partition rows in each table into equivalence 
classes, all rows in a class being references to one real-world entity. Obviously, 
knowing that two different rows in the author table refer to the same person 
(e.g., one may abbreviate the first name) may help us infer that two rows in the 
paper table refer to the same real-world paper. 

A veriety of new techniques are being brought to bear on record linkage [10] 
and coreferent resolution [20], and this is an exciting area of current research. 

1 Many modern part-of-speech taggers are in turn driven by state transition models. 
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3.3 Bootstrapping Ontologies from the Web 

The set of entity types of interest to a search system keeps growing and changing. 
A fixed set of types and entities may not keep up. The system may need to 
actively explore the corpus to propose new types and extract entities for old and 
new types. Eventually, we would like the system to learn how to learn. 

Suppose we want to discover instances of some type of entity (city, say) on 
the Web. We can exploit the massive redundancy of the Web and use some very 
simple patterns [16,8,1,13]: 

“cities” {“,”} “such as” NPList2 
NP1 {“,”} “and other cities” 

“cities” {“,”} “including” NPList2 
NP1 “is a city” 

Here { } denotes an optional pattern and NP is a noun phrase. These patterns 
are fired off as queries to a number of search engines. A set of rules test the 
response Web pages for the existence of valid instantiations of the patterns. A 
rule may look like this: 

NP1 “such as” NPList2 AND 

head(NPl)= “cities” AND 
properNoun(head(eaclr(NPList2))) 

=£- instanceOf(City,head(eaclr(NPList2))) 

KnowItAll [13] makes a probabilistic assessment of the quality of the extrac- 
tion by collecting co-occurrence statistics on the Web of terms carefully chosen 
from the extracted candidates and pre-dehned discriminator phrases. E.g., if X 
is a candidate actor, “X starred in” or “starring X” would be good discriminator 
phrases. KnowItAll uses the point-wise mutual information (PMI) formulation 
by Turney [24] to measure the association between the candidate instance I and 
the discriminator phrase D: PMI(/,D) = |Hits(Z? + /)|/|Hits(/)|. 

Apart from finding instances of types, it is possible to discover subtypes. E.g., 
if we wish to find instances of scientists, and we have a seed set of instances, we 
can discover that physicists and biologists are scientists, make up new patterns 
from the old ones (e.g. “scientist X” to “physicist X”) and improve our harvest 
of new instances. 

In Sections 3.5 and 3.6 we will see how automatic extraction of ontologies 
can assist next-generation search. 

3.4 Searching Relational Data with NL Queries 

In this section and the next (§3.5), we will assume that information extraction 
and alias analysis have led to a reasonably clean entity-relationship (ER) graph. 
The graphs formed by nodes corresponding to authors, papers, conferences and 
journal in DBLP, and actors/actresses, movies, awards, genres, ratings, produc- 
ers and music directors in the Internet Movie Database (IMDB) are examples 
of reasonably clean entity-relationship data graphs. Other real-life examples in- 
volve e-commerce product catalogs and personal information management data, 
with organizations, people, locations, emails, papers, projects, seminars, etc. 

There is a long history of systems that give a natural language interface 
(NLI) to relational engines [4], but, as in general NLP research, recent work has 
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moved from highly engineered solutions to arbitrarily complex problems to less 
knowledge-intensive and more robust solutions for limited domains [21]. E.g., for 
a table JQB(description, platform, company) and the NL query What are the 
HP jobs on a UNIX system?, the translation to SQL might be select distinct 
description from JOB WHERE company = ’HP’ and platform = ’UNIX’. 
The main challenge is to agree on a perimeter of NL questions within which 
an algorithm is required to find a correct translation, and to reliably detect 
when this is not possible. 

3.5 Searching Entity-Relationship Graphs 

NLI systems take advantage of the precise schema information available with 
the “corpus” as well the well-formed nature of the query, even if it is framed 
in uncontrolled natural language. The output of IE systems has less elaborate 
type information, the relations are shallower, and the questions are most often 
a small set of keywords, from users who are used to Web search and do not wish 
to learn about any schema information in framing their queries. 

Free-form keyword search in ER graphs raises many interesting issues, in- 
cluding the query language, the definition of a “document” in response to a 
query, how to score a document which may be distributed in the graph, and how 
to search for these subgraphs efficiently. 

Multiple words in a query may not all match within a single row in a sin- 
gle table, because ER graphs are typically highly normalized using foreign key 
constraints. In an ER version of DBLP, paper titles and author names are in 
different tables, connected by a relation wrote (author, paper). In such cases, 
what is the appropriate unit of response? Recent systems [6,3,17] adopt the view 
that the response should be some minimal graph that connects at least one node 
containing each query keyword. 

Apart from type-free keyword queries, one may look for a single node of a 
specified type (say, a paper) with high proximity to nodes satisfying various 
predicates, e.g., keyword match (“indexing”, “SIGIR”) or conditions on numeric 
fields (year<1995). Resetting random walks [5] are a simple way to answer such 
queries. These techniques are broadly similar to Pagerank [9], except that the 
random surfer teleports only to nodes that satisfy the predicates. Biased random 
walks with restarts are also related to effective conductance in resistive networks. 
In a large ER graph, it is also nontrivial to explain to the user why/lrow enti- 
ties are related; this is important for diagnostics and eliciting user confidence. 
Conductance-based approaches work well [14]: we can connect +1 V to one node, 
ground the other, penalize lrigh-fanout nodes using a grounded sink connected 
to every node, and report subgraphs that conduct the largest current out of the 
source node. 

Recent years have seen an explosion of analysis and search systems for ER 
graphs, and I expect the important issues regarding meaningfulness of results 
and system scalability to be resolved in the next few years. 

3.6 Open-Domain Question Answering 

Finally, the Web at large will continue to be an “open-domain” system where 
comprehensive and accurate entity and relation extraction will remain elusive. 
No schema of entities and relationships can be complete at any time, even if 
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they become more comprehensive over time. Moreover, even a cooperative user 
will not be able to remember and exploit a universal “type system” in asking 
questions. Instead, search systems will provide some basic set of roles [15] that 
apply broadly. Questions will express roles or refinements of roles, and will be 
matched to probabilistic role annotations in the corpus. 

In open-domain QA, question analysis and response scoring will necessarily 
be far more tentative. Some basic machine learning will reveal that the question 
When was television invented? expects the type of the answer (atype) to be a 
date , and that the answer is almost certainly only a few tokens from the word 
television or its synonym. In effect, current technology [22,2,12,23] can translate 
questions into the general form 

find x from corpus where x InstanceOf (Atype (question) ) 
and x RelatedTo GroundConstants (question) 

Here Atype (question) represents the concept of time, and we are looking for a 
reference to an entity x which is an instance of time. (This is where a system like 
KnowItAll comes into play.) In the example above, television or TV would 

be in GroundConstants (question) . 

Checking the predicate RelatedTo is next to impossible in general. QA sys- 
tems employ a variety of approximations. These may be as crude as linear prox- 
imity (the number of of tokens separating x from GroundConstants (question) . 
Linear proximity is already surprisingly effective [23]. More sophisticated sys- 
tems 2 attempt a parse of the question and the passage, and verify that x and 
GroundConstants (question) are related in a way specified by (a parse of) the 
question. As might be expected, there is a trade-off beteen speed and robustness 
on one hand and accuracy and brittleness on the other. 

4 Conclusion 

Many of the pieces required for better searching are coming together. Current 
an upcoming research will introduce synergy as well as build large, robust ap- 
plications. The applications will need to embrace bootstrapping and life-long 
learning better than before. The architecture must isolate feature extraction, 
models, and algorithms for estimation and inferencing. The interplay between 
processing stages makes this goal very challenging. The applications must be 
able to share models and parameters across different tasks and across time. 
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Machine learning and data mining systems have achieved many impressive suc- 
cesses, but to become truly widespread they must be able to work with less help 
from people. This requires automating the data cleaning and integration process, 
handling multiple types of objects and relations at once, and easily incorporat- 
ing domain knowledge. In this talk, I describe how we are pursuing these aims 
using Markov logic networks, a representation that combines first-order logic 
and probabilistic graphical models. Data from multiple sources is integrated by 
automatically learning mappings between the objects and terms in them. Rich 
relational structure is learned using a combination of ILP and statistical tech- 
niques. Knowledge is incorporated by viewing logic statements as soft constraints 
on the models to be learned. Application to a real-world university domain shows 
our approach to be accurate, efficient, and less labor-intensive than traditional 
ones. 

This work, joint with Parag and Matthew Richardson, is described in further 
detail in Richardson and Domingos [1], Richardson and Domingos [2], and Parag 
and Domingos [3]. 
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Abstract. The scientific analysis of data is only around a century old. For most 
of that century, data analysis was the realm of only one discipline - statistics. As 
a consequence of the development of the computer, things have changed dra- 
matically and now there are several such disciplines, including machine learn- 
ing, pattern recognition, and data mining. This paper looks at some of the simi- 
larities and some of the differences between these disciplines, noting where they 
intersect and, perhaps of more interest, where they do not. Particular issues ex- 
amined include the nature of the data with which they are concerned, the role of 
mathematics, differences in the objectives, how the different areas of application 
have led to different aims, and how the different disciplines have led sometimes 
to the same analytic tools being developed, but also sometimes to different tools 
being developed. Some conjectures about likely future developments are given. 



1 Introduction 

This paper gives a personal view of the state of affairs in data analysis. That means 
that inevitably I will be making general statements, so that most of you will be able to 
disagree on some details. But I am trying to paint a broad picture, and I hope that you 
will agree with the overall picture. 

We live in very exciting times. In fact, from the perspective of a professional data 
analyst, I would say we live in the most exciting of times. Not so long ago, analysing 
data was characterised by drudgery, by manual arithmetic, and the need to take great 
care over numerical trivia. Nowadays, all that has been swept aside, with the burden of 
tedium having been taken over by the computer. What we are left with are the high- 
level interpretations and strategic decisions; we look at the summary values derived by 
the computers and make our statements and draw conclusions and base our actions on 
these. It is clear from this that the computer has become the essential tool for data 
analysis. 

But there is more. The computer has not merely swept aside the tedium. The awe- 
some speed of numerical manipulation has permitted the development of entirely new 
kinds of data analytic tools, being applied in entirely new ways, to entirely new kinds, 
and indeed sizes, of data sets. The computer has given us new ways to look at things. 
The old image, that data analysis was the realm of the boring obsessive, is now so 
diametrically opposite to the new truth as to be laughable. 
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This paper describes some of the history, some of the tools, and something of how I 
see the present status of data analysis. So perhaps 1 should begin with a definition. 
Data analysis is the science of discovery hr data, and of processing data to extract 
evidence so that one can make properly informed decisions. In brief, data analysis is 
applied philosophy of science: the theory and methods, not of any particular scientific 
discipline itself, but of how to find things out. 



2 The Evolution of Data Analytic Disciplines 

The origins of data analysis can be traced back as far back as one likes. Think of Ke- 
pler and Gauss analysing astronomical data, of Florence Nightingale using plots to 
demonstrate that soldiers were dying because of poor hygiene rather than military 
action, of Quetelet’s development of ‘social mechanics’, and the fact that world’s 
oldest statistical society, the Royal Statistical Society, was established in 1834. But 
these ‘origins’ really only represent the initial stirrings: it wasn’t until the start of the 
20th century that a proper scientific discipline of data analysis really began to be 
formed. That discipline was statistics, and for the first half of the 20th century statis- 
tics was the only data analytic game in town. Until around 1950, statistics was the 
science of data analysis. (You will have to permit me some poetic leeway in my 
choice of dates: 1960 might be more realistic.) 

Then, around the middle of the 20th century, the computer arrived and a revolution 
began. Statistics began to change rapidly in response to the awesome possibilities the 
computer provided. There is no doubt that, had statistics been born now, at the start of 
the 21st century, rather than 100 years ago at the start of the 20th, it would be a very 
different kind of animal. (Would we have the f-test?.) Moreover, although statistics 
was the intellectual owner of data analysis up until about 1950, it was never the intel- 
lectual owner of data per se, and in the following decades other changes occurred 
which were to challenge the position assumed by statistics. In particular, another dis- 
cipline grew up, whose primary responsibility was, initially, the storage and manipula- 
tion of data. From data manipulation to data analysis was then hardly a large step. 
Statistics was no longer the only player. 

Nowadays, of course, computer science has grown into a vast edifice, and different 
subdisciplines of it have developed as specialised areas of data analysis, all overlap- 
ping with each other and overlapping with their intellectual parent, statistics. These 
subdisciplines include machine learning, pattern recognition, and data mining, and one 
could arguably include image processing, neural networks, and perhaps even computa- 
tional learning theory and other areas also. I cannot avoid remarking that Charles Bab- 
bage, typically regarded as one of the fathers of computing with his analytical engine, 
would have been fascinated by these developments: he was also one of the founders of 
the Royal Statistical Society. Of course, these various data analytic disciplines are not 
carbon copies of each other. They have subtly different aims and emphases, and often 
deal with rather different kinds of data (e.g. in terms of data set size, correlations, 
complexities, etc.). One of my aims in this talk is to examine some of these differ- 
ences. Moreover, if the computer has been the strongest influence leading to the de- 
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velopment of new data analytic technologies, application areas have always been and 
continue to have a similar effect. Thus we have psychometrics, bioinformatics, 
chemometrics, technometrics, and other areas, all addressing the same sorts of prob- 
lems, but in different areas. 1 shall say more about this below. 



3 Data 

I toyed briefly with the idea of calling this talk ‘analysing tomorrow’s data’ since one 
of the striking things about the modern world of data analysis is that the data with 
which we now have to deal could not have been imagined 100 years ago. Then the 
data had to be painstakingly collected by hand since there was no alternative, but 
nowadays much data acquisition is automatic. This has various consequences. 

Firstly, astronomically vast data sets are readily acquired. Books on data mining 
(e.g. [2], [3]), which is that particular data analytic discipline especially concerned with 
analysing large data sets, illustrate the sorts of sizes which are now being encountered. 
The word terabyte is no longer unusual. When I was taught statistical data analysis, I 
was taught that first one must familiarise oneself with one’s data: plot it this way and 
that, look for outliers and anomalies, fit simple models and examine diagnostics. With 
a billion data points (one of the banking data sets I was presented with) this is clearly 
infeasible. Other problems involve huge numbers of variables, and perhaps relatively 
few data points, posing complex theoretical as well as practical questions: bioinfor- 
matics, genomics, and proteomics are important sources of such problems. 

Secondly, one might have thought that automatic data acquisition would mean bet- 
ter data quality, since there would be no subjective human intervention. Unfortunately, 
this has not turned out to be the case. New ways of collecting data has meant new 
ways for the data collection process to go wrong. Worse, large data sets can make it 
more difficult to detect many of the data anomalies. 

Data can be of low quality in many ways: individual values may be distorted or ab- 
sent, entire records may be missing, measurement error may be large, and so on. As 
discussed below, much of statistics is concerned with inference - with making state- 
ments about objects or values not seen or measured, on the basis of those which have 
been. Thus we might want to make statements about other objects from a population, 
or about the future behaviour of objects. Accurate inferences can only be made if one 
has accurate information on how the data were collected. Statisticians have therefore 
predicated their analyses on the assumption that the available observations were drawn 
in well-specified ways, or that the departures from these ways were understood and 
could be modelled. Unfortunately, with many of today’s data sets, such assumptions 
often cannot be made. This has sometimes made statisticians (quite properly) wary of 
analysing such data. But the data still have to be analysed: the questions still need 
answers. This is one reason why data mining has been so successful, at least at first 
glance. Data miners have been prepared to examine distorted data, and to attempt to 
draw conclusions about it. It has to be said, however, that often that willingness has 
arisen from a position of ignorance, rather than one of awareness of the risks that were 
being taken. Relatively few reports of the conclusions extracted from a data mining 
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exercise, for example, qualify those conclusions with a discussion of the possible im- 
pact of selectivity bias on the data being analysed. This is interesting because, almost 
by definition, data mining is secondary data analysis: the analysis of data collected for 
some other purpose. The data may be of perfect quality for its original purpose (e.g. 
calculating your grocery bill in the store), but of poor quality for subsequent mining 
(e.g. because some items were grouped together in the bill). 

A third difference between many modern data analysis problems and those of the 
past is that nowadays they are often dynamic. Electronics permit data to be collected 
as things happen, and this opens the possibility of of making decisions as the data are 
collected. An example is in commercial transactions, where a customer can supply 
information and expects an immediate decision. In such circumstances one does not 
have the luxury of taking the data back to one’s laboratory and analysing it at leisure. 
Speech recognition is another example. This issue has led to new kinds of analytic 
tools, with an emphasis on speed and not merely accuracy. No particular area of data 
analysis seems to have precedence for such problems, but the computer science side, 
perhaps especially machine learning clearly regards such problems as important. 

Although every kind of data analytic discipline must contend with all kinds of data, 
there is no doubt that different kinds are more familiar in different areas. Computa- 
tional areas probably place more emphasis on categorical data than on continuous 
data, and this is reflected in the types of data analytic tools (e.g. methods for extracting 
association rules) which have been developed. 

4 The Role of Mathematics 

Modern statistics is often regarded as a branch of mathematics. This is entirely inap- 
propriate, Indeed, the qualitative change induced by the advent of the computer means 
that statistics could equally be regarded as a branch of computer science. 

In a sense statistics, and data analysis more generally, is the opposite of mathemat- 
ics. Mathematics begins with assumptions about the structure of the universe of dis- 
course (the axioms) and seeks to deduce the consequences. Data analysis, on the other 
hand, begins with observations of the consequences (the data) and seeks to infer some- 
thing about the structure of the universe. One consequence of this is that one can be a 
good mathematician without understanding anything about any area to which the 
mathematics will be applied - one primarily needs facility with mathematical symbol 
manipulation - but one cannot be a good statistician without being able to relate the 
analysis to the world from which the data arose. This is why one hears of mathematics 
prodigies, but never statistics prodigies. Analysis requires understanding. 

There are other differences as well. Nowadays a computer is an essential and indis- 
pensable tool for statistics, but one can still do much mathematics without a computer. 
This is brought home to our undergraduate students, taking mathematics degrees, with 
substantial statistical components, when they come to use software: statistical software 
packages such as Splus, R, SAS, SPSS, Stata, etc., are very different from mathemati- 
cal packages such as Maple and Mathematica. Carrying out even fairly basic statistical 
analyses using the latter can be a non-trivial exercise. 
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David Finney has commented that it is no more true to describe statistics as a 
branch of mathematics than it would be to describe engineering as a branch of mathe- 
matics, and John Nelder has said ‘ The main danger, I believe, in allowing the ethos of 
mathematics to gain too much influence in statistics is that statisticians will be tempted 
into types of abstraction that they believe will be thought respectable by mathemati- 
cians rather than pursuing ideas of value to statistics.’ 

There is no doubt that the misconception of statistics as mathematics has been det- 
rimental in the past, especially in commercial and business applications. Data mining, 
in particular took advantage of this - its very name spells glamour and excitement, the 
possibility of gaining a market edge for free. But there are also other examples where 
the image of statistics slowed its uptake. For example, experimental design (that 
branch of statistics concerned with efficient and cost effective ways to collect data) 
was used in only relatively few sectors (mostly manufacturing). Reformulations of 
experimental design ideas under names such as the Taguchi method and Six Sigma, 
however, have had a big impact. If anything ought to convince my academic col- 
leagues of the power of packaging and presentation, then it should be these examples. 

5 Several Cultures Separated by a Common Language 

The writer George Bernard Shaw once described England and America as ‘two cul- 
tures divided by a common language’, and I sometimes feel that the same applies to 
the various data analytic disciplines. Over the years, I have seen several intense de- 
bates between proponents of the different disciplines. Part of the reason for this lies in 
the different philosophical approaches to investigation. Statistics, perhaps because of 
its mathematical links, places a premium on proof and mathematical demonstration of 
the properties of data analytic tools. For example, demonstrating mathematically that 
an algorithm will always converge. Machine learning, on the other hand, places more 
emphasis on empirical testing. Of course there is overlap. Most methodological statis- 
tics papers include at least one example of the methods applied to real problems, and 
most machine learning papers describe the ideas in mathematical terms, but there is a 
clear difference in what is regarded as of central importance. 

Another reason for the debates has been that many of the ideas were developed in 
parallel, by researchers naturally keen to establish their priority and reputation. This 
led to claims to the effect that ‘we developed it first’ or ‘we demonstrated that prop- 
erty years ago.’ This was certainly evident in the debates on recursive partitioning 
tree classifiers, which were developed in parallel by the machine learning and statis- 
tics communities. 

Misunderstandings can also arise because different schools place emphasis on dif- 
ferent things. Early computer science perspectives on data mining stressed the finding 
of patterns in databases. This is perfectly natural: it is something often required (e.g. 
what percentage of my employees earn more than €x p.a.?). However, this is of lim- 
ited interest to a statistician, who will normally want to make an inference to a wider 
population or to the future (e.g. what percentage of my employees are likely to earn 
more than €x p.a. next year?). Much work on association analysis has ignored this 
inferential aspect. Moreover, much work has also made a false causal assumption: 
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while it is interesting to know that ten times as many people who bought A also 
bought B, it is valuable to know that if people can be induced to buy A they will also 
buy B, and the two are not the same. 

While there have been tensions between the different areas when they develop 
similar models, each from their own perspective, there is no doubt that these tensions 
can be immensely beneficial from a scientific perspective. A nice example of this is 
the work on feedforward neural networks. These originally came from the computer 
(or, one might argue, the cybernetics, electrical engineering, or even biological) side 
of things. The perspective of a set of fairly simple interacting processors dominated. 
Later, however, statisticians became involved and translated the ideas into mathemati- 
cal terms: such models can be written as nested sequences of nonlinear transfor- 
mations of linear combinations of variables. Once written in fairly standard terms, one 
can apply the statistical results of a century of theoretical work. In particular, one 
could explain that the early neural network claims of very substantial improvement in 
predictive power were likely to be in large part due to overfitting the design data, and 
to present ideas and tools for avoiding this problem. Of course, nowadays all these are 
well understood by the neural network community, but this was certainly not the case 
in the early days (I can remember papers presenting absurdly overoptimistic claims), 
even though statisticians had known about the issues for decades. 

If the computer is leading to a unification of the data analytic schools, so also are 
some theoretical developments. The prime examples here, of course, are Bayesian 
ideas. Bayes’s theorem tells us how we should update our knowledge in the light of 
new information. This is the very essence of learning, so it is not surprising that ma- 
chine learning uses these ideas. With the advent of practical computational tools for 
evaluating high dimensional integrals, such as MCMC, statistics has also undergone a 
dramatic Bayesian revolution, not only in terms of dynamic updating models but also 
in terms of model averaging. Indeed, model averaging, like the understanding of over- 
fitting (indeed, closely connected to it), has led to deep theoretical advances. Tools 
such as boosting and bagging are based on these sorts of principles. Boosting, in par- 
ticular, is interesting from our perspective because it illustrates the potential synergy 
which can arise from the disparate emphases of the different disciplines. Originally 
developed by the machine learning community, who proposed it on fairly intuitive 
grounds and showed that it worked in practical applications, it was then explored theo- 
retically by statisticians, who showed its strong links to generalised additive models, a 
well-understood class of statistical tools. The most recent tool to experience this initial 
development, followed by an exposure to the ideas and viewpoints of other data ana- 
lytic disciplines, is that of support vector machines. 

In fact, perceptrons (the progenitor of support vector machines) and logistic dis- 
crimination provide a very nice illustration of the difference in emphasis between, in 
this case, statistical and machine learning models for classification. Logistic discrimi- 
nation fits a model to the probability that an object with given features x will belong to 
class 0 rather than class 1. Typically, the model is fitted by finding the parameters 
which maximise the design set log likelihood: 
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log L oc ^ ]°g p (0 1 x, ) . (1) 

;= l 

Classification is then effected by comparing an estimated probability with a threshold. 
It is immediately clear from (1) that all design set data points contribute - it is really an 
average of contributions. This is fine if one’s model p (0 1 x) has the form of the 

‘true’ function p (0 | x) . But this is a brave assumption. It is likely that the model is 

not perfect. If so, one must question the wisdom of letting data points with estimated 
probability far from the classification threshold contribute the same amount to the fit 
criterion (1) as do those near to it (see [4]). In contrast, perceptron models focus atten- 
tion on whether or not the design set points are correctly classified: quality of fit of a 
model far from the decision surface, which is broadly irrelevant to classification per- 
formance, does not come into it. 

An example of another area which has been developed in rather different ways by 
different disciplines is the area I call pattern discovery. This is the search for, identifi- 
cation of, and description of anomalously high local densities of data points. The com- 
puter science literature has focused on algorithms for finding such configurations. In 
particular, a great deal of work has occurred when the data are character strings, in, 
especially text search (e.g. web search engines) and nucleotide sequences. In contrast, 
the statistical work has concentrated on the inference problem, developing scan statis- 
tics for deciding whether a local anomaly represents a real underlying structure is just 
random variation of a background model. Ideas of this kind have been developed in 
many application areas, including bioinformatics, technical stock chart analysis, as- 
tronomy, market basket analysis, and others, but the realisation that they are all tack- 
ling very similar problems appears to be only recent. 

Implicit in the last two paragraphs is one of the fundamental differences in empha- 
sis between computational and statistical approaches to data analysis - again an under- 
standable difference in view of their origins. This is the emphasis of the computational 
approaches on algorithms (e.g. the perceptron error-correcting algorithm) and the 
emphasis of the statistical approaches on models (e.g. the logistic discrimination 
model). Both algorithms and models are, of course, important when tackling real prob- 
lems. 

It is my own personal view that one can also characterise the difference between 
the two perspectives, at least to some extent, in terms of degree of risk. The computa- 
tional schools seem often prepared to try something without the assurance that it will 
work, or that it will always work, but in the hope (or knowledge from previous analy- 
ses) that it will sometimes work. The statistical schools seem more risk averse, requir- 
ing more assurance before carrying out an analysis. Perhaps this is illustrated by the 
approaches to pattern discovery mentioned above: the data mining community devel- 
ops algorithms with which to detect possible patterns, while the statistical community 
develops tools to tell whether they are real or merely chance. Once again, both per- 
spectives are valuable, especially in tandem: adventurous risk-taking offers the possi- 
bility of major breakthroughs, while careful analysis shows one that the method gives 
reliable results. 
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6 Future Tools and Application Areas? 

Of course, the various data analytic disciplines are constantly evolving. We live in 
very exciting times because of the tools which have been developed over the past few 
decades, but that development has not stopped. If anything, it has accelerated and will 
continue to do so as the computational infrastructure continues to develop. This means 
faster and larger (in terms of all dimensions of datasets). Judging from the past, this 
will translate into analytic tools about which one previously could only have dreamt, 
and, further, into tools one could not even have imagined. 

If the computer is one force driving the development of new data analytic tools, I 
can see at least two others. 

The first of these are application areas, mentioned above. Certainly, the growth of 
statistics over the 20th century was strongly directed by the applications. Thus agricul- 
tural requirements led to the early development of experimental design, psychology 
motivated the development of factor analysis and other latent variable models, medi- 
cine led to survival analysis, and so on. In other areas, speech recognition stimulated 
work on hidden Markov models, robotics stimulated work on reinforcement learning, 
etc. Of course, once developments have been started, and the power of the tools being 
developed has been recognised, other application areas rapidly adopt the tools. 

As with the impact of developing computational infrastructure, I see no reason to 
expect this influence of application areas to stop. We are currently witnessing the 
particular requirements of genomic, proteomic, and related data leading to the devel- 
opment of new analytic tools; for example, methods for handling fat data - data in- 
volving many (perhaps tens of thousands of) variables, but few (perhaps a few tens of) 
data points. Mathematical finance is likewise an area which is shifting its centre of 
gravity towards analysis. Until recently characterised by mathematical areas such as 
stochastic calculus, it is increasingly recognised that data analysis is also needed - the 
values of the model parameters must come from somewhere. More generally, the area 
of personal finance is beginning to provide a rich source of novel problems, requiring 
novel solutions. The world wide web, of course, is another source of new types of 
data, and new problems. This area, in particular, is a source of data which is character- 
ised by its dynamic properties, and I expect the analysis of dynamic data to play an 
even more crucial role in future developments. Decisions in telecoms systems, even in 
day-to-day purchasing transactions, are needed now , not after a leisurely three months’ 
analysis of a customer’s track record and characteristics. Delay loses business. 

The second additional driving force I can see is also not really a new one. It has al- 
ways been with us, but it will lead to the development of new kinds of tools, in re- 
sponse to new demands and also enabled by the advancing computational infrastruc- 
ture, This is the need to model finer and finer aspects of the presenting problems. A 
recent example of this is in the analysis of repeated measures data. The last two dec- 
ades have witnessed a very exciting efflorescence of ideas for tackling such data. The 
essential problem is to recognise and take account of the fact that repeated measure- 
ments data are likely to be correlated (with the (multiple) series being too short to use 
time series ideas). Classical assumptions of independence are all very well, but more 
accurate models and predictions result when the dependence is modelled. Another 
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example of such ‘finer aspects’ of the presenting problem, which has typically been 
ignored up until now, is the fact that predictive models are likely to be applied to data 
drawn from distributions different from that from which the design data were drawn 
(perhaps a case for dynamic models). There are many other examples. 

There is, however, a cautionary comment to be made in connection with this driv- 
ing force. It is easy to go too far. There is little point is developing a method to cope 
with some aspect of the data if the inaccuracies induced by that aspect are trivial in 
comparison with those arising from other causes. Data analysis is not a merely 
mathematical exercise of data manipulation. 

If we data analysts live in exciting times, I think it is clear that the future will be 
even more exciting. Looking back on the past it is obvious that the tensions between 
the different data analytic disciplines have, in the end, been beneficial: we can learn 
from the perspectives and emphases of the other approaches. In particular, we should 
learn that the other disciplines can almost certainly shed light on and help each of us 
gain greater understanding of what we are trying to do. We should look for the syner- 
gies, not the antagonisms. 

I’d like to conclude with two quotations. The first is from lohn Chambers, the 
computational statistician who developed Splus and who won the 1998 ACM Software 
System Award for that work. He wrote: ‘Greater statistics can be defined simply, if 
loosely, as everything related to learning from data, from the first planning or collec- 
tion to the last presentation or report. Lesser statistics is the body of specifically statis- 
tical methodology that has evolved within the profession - roughly, statistics as defined 
by texts, journals, and doctoral dissertations. Greater statistics tends to be inclusive, 
eclectic with respect to methodology, closely associated with other disciplines, and 
practiced by many outside of academia and often outside of professional statistics. 
Lesser statistics tends to be exclusive, oriented to mathematical techniques, less fre- 
quently collaborative with other disciplines, and primarily practiced by members of 
university departments of statistics. ’ [1] 

John has called the discipline of data analysis ‘greater statistics’, but I am sure we 
can all recognise what we do in his description. What we call it is not important. As 
Juliet puts it in Act II, Scene ii of Shakespeare’s Romeo and Juliet. 

'What’s in a name? that which we call a rose 
By any other name would smell as sweet. ' 
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Abstract. Reinforcement learning (RL) algorithms attempt to assign 
the credit for rewards to the actions that contributed to the reward. 
Thus far, credit assignment has been done in one of two ways: uniformly, 
or using a discounting model that assigns exponentially more credit to 
recent actions. This paper demonstrates an alternative approach to tem- 
poral credit assignment, taking advantage of exact or approximate prior 
information about correct credit assignment. Infinite impulse response 
(HR) filters are used to model credit assignment information. IIR filters 
generalise exponentially discounting eligibility traces to arbitrary credit 
assignment models. This approach can be applied to any RL algorithm 
that employs an eligibility trace. The use of IIR credit assignment filters 
is explored using both the G POM DP policy-gradient algorithm and the 
Sarsa(A) temporal-difference algorithm. A drop in bias and variance of 
value or gradient estimates is demonstrated, resulting in faster conver- 
gence to better policies. 



1 Introduction 

A reinforcement learning (RL) agent performs actions in a world according to a 
parameterised policy. The agent seeks to adjust the policy in order to maximise 
a long-term measure of reward. A core difficulty is correctly distinguishing which 
actions caused high long-term pay offs. This temporal credit assignment problem 
is usually solved in one of two ways: (1) if the task ends in finite time then credit 
can sometimes be assigned uniformly; (2) if the task has an infinite horizon a 
discounted model assigns exponentially more credit to recent actions. 

This paper shows how to use exact or approximate prior information about 
reward delays to build a tailored credit assignment model. Taking advantage of 
such information leads to a reduction in the bias and variance of estimates; either 
value estimates or gradient estimates depending on the algorithm employed. We 
demonstrate that the drop in bias and variance can be orders of magnitude given 
good prior information about the correct way to assign credit to actions. 

For example, the action of adding chemicals to a bio-reactor, or the injection 
of a drug into a patient, will not have an immediate impact. Rather, the response 
of the system to the input ramps up with time to a maximum, then decays slowly. 
In this case exponential credit assignment is inappropriate because the majority 
of the credit goes to the most recent input (or action), rather than the input 
that is currently dominating the system response. 
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